import numpy as np
import pandas as pd
import seaborn as sns
import pickle
import dalex as dx
# libraries that are used in creating objects from pickle
from sklearn.ensemble import RandomForestRegressor
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OrdinalEncoder
from sklearn.preprocessing import OneHotEncoder
from sklearn.base import BaseEstimator, TransformerMixin
# code necessery to load model from pickle
rooms_ix, bedrooms_ix, population_ix, households_ix = 3, 4, 5, 6
class CombinedAttributesAdder(BaseEstimator, TransformerMixin):
def __init__(self, add_bedrooms_per_room = True):
self.add_bedrooms_per_room = add_bedrooms_per_room
def fit(self, X, y=None):
return self
def transform(self, X):
rooms_per_household = X[:, rooms_ix] / X[:, households_ix]
population_per_household = X[:, population_ix] / X[:, households_ix]
if self.add_bedrooms_per_room:
bedrooms_per_room = X[:, bedrooms_ix] / X[:, rooms_ix]
return np.c_[X, rooms_per_household, population_per_household,
bedrooms_per_room]
else:
return np.c_[X, rooms_per_household, population_per_household]
with open('full_model.pkl', 'rb') as f:
model = pickle.load(f)
with open('test_dataset.pkl', 'rb') as f:
test_data = pickle.load(f)
X = test_data.drop(columns=['median_house_value'])
y = test_data['median_house_value']
Please note that train data is not provided here as model was trained in different noebook to remove redundant code.
model_exp = dx.Explainer(model, X, y,
label = "housing RF Pipeline")
Preparation of a new explainer is initiated -> data : 4128 rows 10 cols -> target variable : Parameter 'y' was a pandas.Series. Converted to a numpy.ndarray. -> target variable : 4128 values -> model_class : sklearn.ensemble._forest.RandomForestRegressor (default) -> label : housing RF Pipeline -> predict function : <function yhat_default at 0x000001B4602FC160> will be used (default) -> predict function : Accepts only pandas.DataFrame, numpy.ndarray causes problems. -> predicted values : min = 4.95e+04, mean = 2.08e+05, max = 5e+05 -> model type : regression will be used (default) -> residual function : difference between y and yhat (default) -> residuals : min = -2.49e+05, mean = -1.66e+03, max = 2.94e+05 -> model_info : package sklearn A new explainer has been created!
observation_1 = X.iloc[[5]]
observation_2 = X.iloc[[321]]
prediction_1 = model.predict(observation_1)
prediction_2 = model.predict(observation_2)
print("Real value of observation 1: {y_1:.0f}; predicted value: {y_1_hat:.0f}".format(y_1=list(y.iloc[[5]])[0], y_1_hat=prediction_1[0]))
print("Real value of observation 2: {y_2:.0f}; predicted value: {y_2_hat:.0f}".format(y_2=list(y.iloc[[321]])[0], y_2_hat=prediction_2[0]))
Real value of observation 1: 120600; predicted value: 137187 Real value of observation 2: 298900; predicted value: 275727
Model predict target very well, so further explanation can be present.
order = X.columns.to_list()
# first observation
model_exp.predict_parts(observation_1.iloc[[0]],
type = 'break_down',
order=order).plot()
model_exp.predict_parts(observation_1.iloc[[0]],
type = 'shap').plot()
Based on break_down plot, the greatest positive inpact on prediction has total_rooms and longitude. However, longitude without latitude does not create significant information and we may expect that only interaction of that variables really matters in the model.
The greates negative impact has ocean_proximity and latitude. Again, only in interaction latitude create siginifact infromation.
On shap plot, households's impact is positive while on break_down it is negative. It may suggest that interaction between households and other variables exists. We may expect interaction with total_* or population variables, as ratio of that variables tell more about housing in the area than single variables.
# second observation
model_exp.predict_parts(observation_2,
type = 'break_down',
order = order).plot()
model_exp.predict_parts(observation_2,
type = 'shap').plot()
While in first observation most variables has negative impact, in the second observation they have mainly positive impact. It is worth mentioning that the greatest negative impact in second observation is observed in households and total_rooms while they have the greatest positive impact in the first observation. The linear positive correlation between target and those two variables should be examinated.
Since values obtained by break_down and shap are different, some interactions in model exist. We can easily find them using break_down_interactions plots:
model_exp.predict_parts(observation_1,
type = 'break_down_interactions',
interaction_preference=1).plot()
model_exp.predict_parts(observation_2,
type = 'break_down_interactions',
interaction_preference=1).plot()
For each observation the interactions are different; however, we can notice that:
latitude and longitude interact with each other;total_bedrooms and total_rooms - we may expect that their ratio could be informaiton about average size of the houses in the area;